NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

MixRT: Mixed Neural Representations For Real-Time NeRF Rendering

https://doi.org/10.1109/3DV62453.2024.00087

Li, Chaojian; Wu, Bichen; Vajda, Peter; Lin, Yingyan Celine (March 2024, The 11th International Conference on 3D Vision)

Neural Radiance Field (NeRF) has emerged as a leading technique for novel view synthesis, owing to its impressive photorealistic reconstruction and rendering capability. Nevertheless, achieving real-time NeRF rendering in large-scale scenes has presented challenges, often leading to the adoption of either intricate baked mesh representations with a substantial number of triangles or resource-intensive ray marching in baked representations. We challenge these conventions, observing that high-quality geometry, represented by meshes with substantial triangles, is not necessary for achieving photorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF representation that includes a low-quality mesh, a view-dependent displacement map, and a compressed NeRF model. This design effectively harnesses the capabilities of existing graphics hardware, thus enabling real-time NeRF rendering on edge devices. Leveraging a highly-optimized WebGL-based rendering framework, our proposed MixRT attains real-time rendering speeds on edge devices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop), better rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360 datasets), and a smaller storage size (less than 80% compared to state-of-the-art methods).
more » « less
Full Text Available
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

https://doi.org/10.1109/CVPR52733.2024.00784

Liang, Feng; Wu, Bichen; Wang, Jialiang; Yu, Licheng; Li, Kunpeng; Zhao, Yinan; Misra, Ishan; Huang, Jia-Bin; Zhang, Peizhao; Vajda, Peter; et al (June 2024, IEEE)

Full Text Available
An Investigation on Hardware-Aware Vision Transformer Scaling

https://doi.org/10.1145/3611387

Li, Chaojian; Kim, Kyungmin; Wu, Bichen; Zhang, Peizhao; Zhang, Hang; Dai, Xiaoliang; Vajda, Peter; Lin, Yingyan (August 2023, ACM Transactions on Embedded Computing Systems)

Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT’s architecture design space and its implication of hardware-cost on different devices. In this work, by simply scaling ViT’s depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared to DeiT-Tiny, our scaled model achieves a\(\uparrow 1.9\% \)higher ImageNet top-1 accuracy under the same FLOPs and a\(\uparrow 3.7\% \)better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1) “can these scaling strategies be transferred across different real hardware devices?”; and (2) “can these scaling strategies be transferred to different ViT variants and tasks?”. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool; for (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency, as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from\(74.6\% \)to\(76.7\% \)(\(\uparrow 2.1\% \)) under the same 0.7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by\(\uparrow 0.7\% \)under a similar throughput on a V100 GPU.
more » « less
Full Text Available
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

https://doi.org/10.1109/CVPR52729.2023.01387

You, Haoran; Xiong, Yunyang; Dai, Xiaoliang; Wu, Bichen; Zhang, Peizhao; Fan, Haoqi; Vajda, Peter; Lin, Yingyan Celine (June 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Vision Transformers (ViTs) have shown impressive per-formance but still require a high computation cost as compared to convolutional neural networks (CNNs), one rea-son is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of in-put tokens. Existing efficient ViTs adopt local attention or linear attention, which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling- ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear-angular attention during inference. Our Castling- ViT leverages angular ker-nels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an aux-iliary masked softmax attention to help learn global and lo-cal information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during inference. Extensive experiments validate the effectiveness of our Castling- ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on classification and 1.2 higher mAP on detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based at-tentions. Project page is available at here.
more » « less
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

https://doi.org/10.1109/CVPR52729.2023.00682

Liang, Feng; Wu, Bichen; Dai, Xiaoliang; Li, Kunpeng; Zhao, Yinan; Zhang, Hang; Zhang, Peizhao; Vajda, Peter; Marculescu, Diana (June 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions

https://doi.org/10.1109/CVPR42600.2020.01298

Wan, Alvin; Dai, Xiaoliang; Zhang, Peizhao; He, Zijian; Tian, Yuandong; Xie, Saining; Wu, Bichen; Yu, Matthew; Xu, Tao; Chen, Kan; et al (June 2020, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
null (Ed.)
Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS's search space is small when compared to other search methods', since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to 10^14x over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421x less search cost, DMaskingNAS finds models with 0.9% higher accuracy, 15% fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20% fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6% in accuracy, with equivalent model size. FBNetV2 models are open-sourced at https://github.com/facebookresearch/mobile-vision.
more » « less
Full Text Available
A LiDAR Point Cloud Generator: from a Virtual World to Autonomous Driving

https://doi.org/10.1145/3206025.3206080

Yue, Xiangyu; Wu, Bichen; Seshia, Sanjit A.; Keutzer, Kurt; Sangiovanni-Vincentelli, Alberto L. (June 2018, ICMR '18 Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval)

3D LiDAR scanners are playing an increasingly important role in autonomous driving as they can generate depth information of the environment. However, creating large 3D LiDAR point cloud datasets with point-level labels requires a significant amount of manual annotation. This jeopardizes the efficient development of supervised deep learning algorithms which are often data-hungry. We present a framework to rapidly create point clouds with accurate pointlevel labels from a computer game. To our best knowledge, this is the first publication on LiDAR point cloud simulation framework for autonomous driving. The framework supports data collection from both auto-driving scenes and user-configured scenes. Point clouds from auto-driving scenes can be used as training data for deep learning algorithms, while point clouds from user-configured scenes can be used to systematically test the vulnerability of a neural network, and use the falsifying examples to make the neural network more robust through retraining. In addition, the scene images can be captured simultaneously in order for sensor fusion tasks, with a method proposed to do automatic registration between the point clouds and captured scene images. We show a significant improvement in accuracy (+9%) in point cloud segmentation by augmenting the training dataset with the generated synthesized data. Our experiments also show by testing and retraining the network using point clouds from user-configured scenes, the weakness/blind spots of the neural network can be fixed.
more » « less
Full Text Available
SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving

Wu, Bichen; Iandola, Forrest; Jin, Peter H; Keutzer, Kurt (December 2016, arXiv.org)

Object detection is a crucial task for autonomous driving. In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires realtime inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment. In this work, we propose SqueezeDet, a fully convolutional neural network for object detection that aims to simultaneously satisfy all of the above constraints. In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast. Our model is fully convolutional, which leads to small model size and better energy efficiency. Finally, our experiments show that our model is very accurate, achieving state-of-the-art accuracy on the KITTI [9] benchmark. The source code of SqueezeDet is open-source released1.
more » « less
Full Text Available

Search for: All records